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ABSTRACT 

This  project  investigates  a  statistical  method  for  analyzing  the  error  on  predictions 
made  through  the  process  of  time-delay-embedding  of  chaotic  time  series. 

When  viewed  as  a  time-series,  chaotic  data  appears  to  be  unpredictable  and 
random.  A  chaotic  system  actually  has  an  orderly  representation  when  viewed  in  its 
proper  state  space  (the  space  consisting  of  the  pertinent  variables  of  the  system.)  A  very 
remarkable  result  from  the  study  of  chaotic  dynamical  systems  shows  that  present  in 
almost  any  single  time  series  is  information  from  all  the  variables  of  the  state  space.  The 
technique  of  time-delay-embedding  provides  a  method  for  making  predictions  on  the 
evolution  of  this  time  series. 

In  this  method  of  prediction,  one  must  choose  a  parameter  k,  the  number  of  near 
neighbors  in  phase  space  to  fit  the  model  to.  This  project  answers  the  question  by 
describing  an  algorithm  for  determining  the  largest  k  such  that  the  model  adequately  fits 
the  data.  A  prediction  is  then  made  from  this  model  along  with  confidence  intervals 
which  measure  the  reliability  of  the  expected  response.  While  this  project  involved  many 
different  data  sets,  the  purpose  was  not  to  analyze  these  specific  data  sets,  but  to  develop 
a  general  algorithm  which  could  theoretically  be  used  on  any  chaotic  system. 


KEYWORDS:  chaos,  time-series  embedding,  prediction,  confidence  intervals, 
dynamical  systems,  noise. 
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1.  INTRODUCTION 

1.1.  A  COMPARISON  OF  TWO  TIME  SERIES 

A  great  deal  of  work  has  been  done  to  study  erratic  time  series.  In  Figures 
(1.1)  and  (1.2),  we  present  two  different  time  series,  one  from  an  actual  chemical 
experiment,  the  other  is  from  a  hypothetical  experiment. 

The  time  series  in  Figure  (1.1)  was  generated  from  the  Belousov-Zhabotinski 
(BZ)  reaction  in  a  continuous-flow  stirred-tank  reactor.  The  data  was  collected  in 
a  forty  hour  laboratory  experiment  by  Dr.  Milos  Dolnik  at  Brandeis  University,  De¬ 
partment  of  Chemistry  and  Center  for  Complex  Systems.  Instead  of  monotonically 
approaching  an  equilibrium  state,  the  concentrations  of  the  reactants  oscillate  spo¬ 
radically.  The  time  series  in  Figure  (1.2)  was  generated  from  a  random  source  to 
provide  an  example  of  a  stochastic  time  series. 


Figure  1.1;  Time  series  from  BZ  reaction. 

Suppose  that  we  suspect  the  BZ  reaction  to  be  deterministic.  How  could 
we  predict  the  evolution  of  the  system  from  a  current  state?  One  way  would  be 
to  model  the  system  based  on  theoretical  considerations  of  chemistry  and  records  of 
actual  data.  The  BZ  reaction  has  been  modeled  with  three  nonlinear  ODE’s  (Gyorgyi 
and  Field,  1992).  In  science  such  an  approach  is  desirable,  but  not  always  possible, 
especially  in  a  case  where  all  one  has  is  a  time  series,  without  knowledge  of  the  system 
which  produced  it. 
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Figure  1.2:  Time  series  generated  from  a  stochastic  source. 


We  will  use  techniques  from  recent  developments  in  nonlinear  time  series 
analysis  to  make  predictions  on  the  evolution  of  this  time  series.  These  methods 
provide  us  a  very  general  approach  to  this  analysis.  We  will  not  use  theoretical 
considerations,  nor  time  series  from  the  other  reactants,  to  generate  a  global  model 
of  the  system.  The  method  revolves  around  producing  a  delay  plot,  which  is  a  plot 
of  the  dependent  variable  versus  itself  at  previous  times. 

In  Figure  (1.3),  r  is  the  time  interval  between  data  samples.  The  fact  that 
this  representation  of  the  data  appears  to  follow  a  curve,  leads  us  to  believe  that 
the  data  is  generated  from  a  deterministic  process  in  such  a  way  that  each  sample  is 
determined  by  the  previous  sample:  =  /(a;„).  Note  the  difference  between  the 

delay  plot  of  the  BZ  data  with  the  delay  plot  of  the  stochastic  data  in  Figure  (1.4). 

Judging  from  this  representation  of  the  second  time  series,  there  is  not  any 
connection  between  one  sample  and  the  next.  In  general,  this  is  how  stochastic  data 
will  appear.  The  delay  plot  will  appear  as  a  cloud  of  data  points. 

The  natural  way  to  make  a  prediction  on  the  evolution  of  the  first  system 
would  be  to  draw  a  curve  through  the  data,  plug  in  a  value  on  the  x-axis,  and  read 
off  the  value  on  the  y-axis  from  the  curve.  It  is  important  to  remember  that  the 
first  time  series  was  not  generated  to  provide  an  example  of  a  procedure  which  we 
could  theoretically  apply  to  a  physical  system;  the  data  was  produced  from  a  physical 
laboratory  experiment. 
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1.2.  OUR  CONTRIBUTION  TO  THE  PHYSICAL  SCIENTIST’S  TOOL¬ 
BOX 

Say  we  are  given  a  time  series  and.  are  asked  to  make  a  prediction  on  its  future 
evolution.  First,  we  would  need  to  embed  the  time  series  in  delay  coordinates. 
Determining  a  good  embedding  from  a  time  series  is  not  a  trivial  task.  Current 
literature  addresses  techniques  to  select  the  time  delay  and  the  embedding  dimension 
(i.e.  number  of  delay  axes)  (Kantz  and  Schreiber,  1997;  Abarbanel,  1996).  In  this 
project,  we  assume  that  we  have  a  good  embedding  (time  delay  embedding  will  be 
discussed  in  detail  in  section  2.5).  The  state  of  the  art  is  to  linearly  regress  the  k 
nearest  neighbors  of  the  point  to  be  predicted.  How  best  to  choose  k  is  not  addressed 
in  the  literature  (Kantz  and  Schreiber,  1997;  Abarbanel,  1996).  We  would  like  to 
bring  a  statistical  approach  to  this  dynamical  systems  question.  The  tool  we  develop 
will  do  the  following. 

1)  We  will  select  the  appropriate  number,  k,  of  near  neighbors  such  that  the 
model  adequately  fits  the  data.  There  is  a  trade-off  between  choosing  a  large  k  vice 
a  small  k.  The  number  of  near  neighbors  we  choose  for  prediction  will  dictate  the 
size  of  the  neighborhood  to  which  we  fit  our  polynomial.  Taylor’s  Theorem  tells 
us  to  choose  a  small  k  to  minimize  the  local  truncation  error  on  the  polynomial 
approximation.  Because  the  data  contains  noise,  we  would  like  to  use  a  large  k  in 
order  to  minimize  the  effect  the  errors  have  on  the  polynomial  fit.  Using  data  from  a 
large  neighborhood,  though  requires  a  higher  order  polynomial,  which  then  requires 
more  data.  We  determine  the  largest  k  such  that  the  model  is  appropriate  for  the 
data. 


2)  We  will  put  confidence  hounds  on  the  prediction  such  that  we  are  able 
to  deal  with  an  interval  prediction  versus  a  point  prediction.  Current  technology 
allows  time  series  embedding  prediction  to  provide  a  point  prediction,  but  does  not 
provide  a  statement  about  the  quality  of  that  prediction  (Abarbanel,  1996;  Kantz  and 
Schreiber,  1997;  Sauer,  1994).  By  providing  a  region  which  we  are,  say,  95%  certain 
the  true  value  of  the  prediction  lies  within,  we  are  providing  a  means  to  measure  the 
reliability  of  the  prediction.  The  model,  along  with  the  data  to  which  it  is  fit,  will 
provide  the  confidence  interval.  Since  our  model  is  well  fit,  the  confidence  interval 
wiU  be  appropriate. 
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2.  CHAOTIC  DYNAMICAL  SYSTEMS 

2.1.  DYNAMICAL  SYSTEMS 

In  order  to  talk  about  the  statistical  analysis  done  on  chaotic  dynamical  systems, 
we  must  first  describe  what  we  mean  by  a  dynamical  system.  We  begin  with  an 
intuitive/physical  description,  and  then  give  a  rigorous  mathematical  definition. 

It  is  enlightening  to  compare  a  dynamical  system  to  an  experiment  involving 
a  physical  system.  This  system  could  be  a  pendulum,  a  kettle  of  water,  a  billiard 
table,  or  an  internal  combustion  engine.  In  this  experiment,  we  are  interested  in 
observing  how  certain  quantitative  variables  evolve  over  time.  For  the  pendulum, 
these  variables  might  be  the  position  and  velocity  of  the  bob.  For  the  internal 
combustion  engine,  these  might  be  oil  temperature  and  pressure,  temperature  of 
engine  block,  and  rate  of  fuel  consumption.  We  start  the  experiment  running  and 
observe  what  happens  to  the  variables  we  are  interested  in.  A  dynamical  system  is 
defined  to  be  deterministic,  i.e.  there  are  no  random  influences  on  the  system.  So  if 
we  start  our  experiment  in  a  certain  state,  collect  some  data,  and  then  restart  it  in 
the  original  state,  and  collect  a  second  set  of  data,  the  two  sets  of  data  will  match 
exactly.  Remember  that  this  is  not  our  definition  but  only  an  analogy.  Now  we 
present  the  rigorous  definition. 

We  define  an  n-dimensional  dynamical  system  to  be  3^"  (the  n-dimensional 
real  numbers)  along  with  a  function  /  (the  evolution  rule)  which  describes  how  any 
state  vector  x  G  5?”  evolves  in  time.  We  will  define  two  different  types  of  dynamical 
systems.  A  discrete  dynamical  system  is  one  in  which  time  is  considered  to  have  a 
smallest  increment.  The  evolution  rule  is  a  continuous  function  which  maps  each  state 
vector  to  the  state  vector  one  time  unit  later.  We  will  describe  the  state  of  the  system 
at  time  t  by  a  vector  G  3?".  Let  Xq  G  3?"  describe  the  initial  state  of  the  system, 
then  for  every  t  G  N  (natural  numbers)  inductively  define  Xj  by  applying  /  to  Xt_i, 
i.e.  x^  =  /(xt_i).  We  define  the  trajectory  of  the  dynamical  system  starting  from  an 
initial  condition  Xq  as  the  infinite  sequence  {xq,  Xi  =  /(xq),  X2  =  /(xi), ...}  =  {xt}gQ. 
The  trajectory  can  be  thought  of  as  the  path  through  which  the  dynamical  system 
evolves  from  the  initial  condition  Xq. 

The  other  type  of  dynamical  system  is  one  in  which  we  consider  time  to  be 
continuous.  The  evolution  rule  for  a  continuous  dynamical  system  is  defined  by  an 
ordinary  differential  equation  x  =  F'(x),  where  F  :  3?”  ^  3?"  is  continuous.  The 
state  of  the  system  at  any  time  t  is  denoted  as  x(t).  Given  any  initial  condition 
x(0)  G  3?",  the  trajectory  of  the  dynamical  system  is  the  solution  to  the  differential 
equation  passing  through  x(0). 
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2.2.  CHAOS 

Devaney  defines  a  chaotic  dynamical  system  as  a  system  which  exhibits  the 
properties  of 

1)  sensitive  dependence  to  initial  conditions, 

2)  topological  transitivity,  and 

3)  having  topologically  dense  periodic  orbits  (Devaney,  1989,  p.  50). 

These  properties  are  defined  in  topological  terms.  Say  we  have  a  topological 

set  J  and  a  function  f  :  J  J.  /is  said  to  be  sensitively  dependent  to  initial 
conditions  if  there  exists  a  ^  >  0,  such  that,  for  any  x  E:  J  and  any  neighborhood  N 
of  X,  there  exists  y  E  N  and  n  >  0  such  that  |/”(a;)  —  f^{y)\  >  S.  /is  said  to  be 
topologically  transitive  if  for  any  open  sets  U  and  V  (Z  J  there  exists  a  A;  >  0  such 
that  f^{U)  n  y  ^  0.  Finally,  /  has  topologically  dense  periodic  orbits  if  for  any  open 
set  U  G  J,  there  exists  x  E  U  and  a  A:  >  0  such  that  /^(a;)  =  x  (Devaney,  1989,  p. 
50). 

Intuitively,  these  definitions  say  the  following.  Say,  we  observe  two  different 
trajectories  of  a  chaotic  dynamical  system,  each  beginning  from  initial  conditions  very 
close  together.  Sensitive  dependence  to  initial  conditions  says  that  these  trajectories 
will  eventually  become  far  apart.  A  function  is  topologically  transitive  on  a  set 
if  throughout  the  entire  set  there  exist  initial  conditions  whose  trajectories  wander 
throughout  the  entire  set.  A  fimction  has  topologically  dense  periodic  orbits  on  a 
set  if  throughout  the  set  there  are  initial  conditions  which  eventually  evolve  back  to 
themselves.  These  three  properties  imply  that  a  chaotic  dynamical  system  will  be 
highly  unpredictable,  it  will  blend  together,  and  yet  there  will  remain  an  element  of 
regularity. 

Note  that  Devaney’s  definition  of  chaos  applies  only  to  the  discrete  case. 
Appropriate  modifications  of  the  definitions  apply  to  the  continuous  case.  Instead  of 
mapping  a  point  or  neighborhood  a  discrete  number  of  times,  one  allows  a  trajectory 
or  set  of  trajectories  to  evolve  over  time. 

2.3.  THE  LOGISTIC  MAP 

The  logistic  map  is  a  one-dimensional  example  of  a  discrete  chaotic  system.  Let 
fj,  =  [0, 1]  denote  the  unit  interval.  The  logistic  map  is  defined  by  /  :  /x  — >•  //,  such 
that  f{x)  —  Ax{l  —  x).  This  map  has  been  studied  in  great  depth  and  is  known 
to  be  chaotic.  Figure  (2.1)  shows  the  time  series  from  the  logistic  map  with  initial 
condition  Xq  =  0.8.  As  a  time  series  the  data  looks  very  erratic.  Since  the  evolution 
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Figure  2.1:  Data  from  the  Logistic  Map  as  a  time  series. 

rule  maps  each  Xt  to  the  map  on  the  phase  space  (3^)  can  be  represented  in  9?^, 
as  shown  in  Figure  (2.2). 

At  first  sight  this  example  may  appear  trivial.  We  generated  data  using  a 
parabolic  map,  so  of  course  we  can  represent  the  data  in  an  ordered  fashion,  i.e.  along 
a  parabola  again.  The  point  is  that,  though  there  is  a  simple  map  /  ;  9?  which 

governs  the  evolution  of  the  system,  it  is  not  obvious  when  the  system  is  viewed  as  a 
time  series. 

2.4.  THE  LORENZ  EQUATIONS 

Arguably,  the  most  widely  known  example  of  chaos  comes  from  the  Lorenz  equa¬ 
tions.  In  fact,  people  who  may  know  very  little  about  chaos  overall  stand  a  good 
chance  of  having  heard  of  this  system.  This  chaotic  dynamical  system  is  given  by  an 
ordinary  differential  equation  in  9?^.  The  Poincare-Bendixson  Theorem  (Alligood, 
Sauer,  and  Yorke,  1996,  p.  337)  tells  us  that  we  cannot  have  chaos  on  a  flow  in  9J^,  so 
three  is  the  smallest  dimension  on  which  we  can  have  a  continuous  chaotic  dynamical 
system.  The  Lorenz  equations  are 

X  =  a{y-x), 
y  =  rx  —  y  —  XU 
u  =  —Bu  -f  xy, 
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Figure  2.2:  Delay  plot  of  Logistic  Map  data. 

where  cr  =  10,  r  =  28,  and  B  =  8/3  are  famous  chaotic  parameters.  We  numeri¬ 
cally  integrate  these  equations,  using  the  fourth-order  Runge-Kutta  method  (Burden, 
Faires,  and  Reynolds,  1978,  p.  244),  from  an  initial  condition  of  x(0)  =  (0, 1,0)  with 
a  time  step  At  =  ^  to  produce  the  trajectory  in  Figure  (2.3).  This  trajectory  lies 
along  what  is  known  as  the  Lorenz  attractor.  The  Lorenz  attractor  provides  a  good 
example  of  what  can  happen  to  a  chaotic  trajectory.  The  trajectory  does  not  wind 
through  all  of  37^,  but  limits  on  a  bounded  subset  of 

2.5.  TIME  SERIES  DELAY  EMBEDDING 

In  the  introduction  we  alluded  to  time  delay  embedding,  now  we  will  discuss  it  in 
detail.  Tokens’ s  Delay  Embedding  Theorem  tells  us  that  we  can  reproduce  something 
equivalent  to  the  phase  space  attractor  of  a  dynamical  system  from  only  one  time 
series  of  an  appropriate  observation  variable  (Takens,  1981).  This  is  truly  amazing. 
It  might  seem  as  though  we  would  need  to  know  how  all  the  pertinent  variables  of 
a  dynamical  system  evolve  in  order  to  characterize  the  underlying  dynamics  of  the 
system.  This  is  not  the  case.  One  variable  captures  enough  information  from  the 
other  variables  to  draw  conclusions  about  the  entire  system. 

Say  we  have  a  chaotic  dynamical  system  described  by  a  state  vector  x(t)  G  W. 
Let  ^  ^  5ft,  be  a  scalar  observation  function  which  takes  any  state  vector  to  a 

corresponding  scalar  value.  This  observation  function  usually  projects  a  trajectory 
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Figure  2.3:  The  Lorenz  attractor,  also  known  as  the  Lorenz  butterfly. 

into  one  of  its  axes.  In  other  words  the  function  g  can  “pick  off”  one  of  the  coordinates 
of  the  state  vector.  For  a  given  embedding  dimension  M  and  a  time  delay  r,  we 
define  the  delay  vector  at  time  t  to  be 

Y{t)  =  {g{x{t)),g{yi{t-T)),g{x{t-2T)),...,g(x{t-{M -1)t))). 

So,  Y{t)  is  a  vector  in  3?^  consisting  of  M  observations  of  the  function  g 
taken  at  equal  time  intervals.  Much  work  has  been  done  to  determine  the  best 
method  for  finding  an  appropriate  embedding  dimension  and  time  delay.  It  has  been 
shown  that  for  an  M  >  2n  +  1  (where  n  is  the  dimension  of  the  system),  and  almost 
any  delay  r ,  the  rule  which  describes  the  evolution  of  Y  {t)  is  topologically  equivalent 
to  the  one  for  x(t). 

2.6.  THE  LORENZ  DELAY  PLOT 

The  delay  plot  for  the  logistic  map  is  perhaps  not  enlightening  or  surprising. 
This  is  because  for  the  natural  choice  of  r  1  time  unit,  the  delay  plot  is  identical 
to  a  plot  of  the  map  of  the  evolution  rule  in  phase  space. 

The  delay  plot  for  the  Lorenz  equations  is  very  interesting.  Recall  that  for  the 
Lorenz  equations  n  =  3.  We  would  therefore  be  guaranteed  that  a  delay  dimension 
of  M  >  2n  +  1  =  7  will  work.  In  this  case,  it  turns  out  that  M  =  3  is  sufficient.  We 
denote  the  state  of  the  Lorenz  system  at  time  t  by 


x(i)  =  {x{t),y{t),u{t)). 
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-20  -20  x(t-2r) 


Figure  2.4:  Topological  reconstruction  of  the  Lorenz  attractor  generated  from  only 
the  X  time  series. 

To  make  observations  on  the  first  coordinate,  we  set  g{x.{t))  —  x{t).  r  =  0.05  yields 
the  delay  vector 

Y{t)  =  {x(t),x(t  -  0.05),  a;(i  -  0.1)). 

Figure  (2.4)  shows  a  plot  of  aU  the  delay  vectors.  Note  that  the  attractor  in  the  time 
delay  coordinates  looks  very  similar  to  the  Lorenz  attractor. 

Bear  in  mind  that  the  delay  plot  is  generated  solely  from  the  apparently 
sporadic  x{t)  time  series  in  Figure  (2.5).  We  need  not  consider  the  y{t)  and  u{t) 
time  series,  nor  even  be  aware  of  their  existences,  to  make  the  delay  portrait. 

2.7.  TIME  DELAY  EMBEDDING  PREDICTION 

Time  delay  embedding  is  not  only  used  to  characterize  the  complete  dynamical 
system  in  phase  space,  it  is  also  used  to  make  predictions  on  an  observed  time  series. 
Say  we  have  time  series  {a:(ti)}“=o-  We  will  represent  this  time  series  as  a  discrete 
sequence  of  points.  This  is  reasonable  even  if  the  signal  producing  the  time  series  is 
a  continuous  one,  since  in  an  actual  experiment  we  will  only  be  able  to  sample  the 
signal  a  finite  number  of  times  over  some  finite  time  interval. 

We  then  select  an  M  and  a  r  and  embed  the  time  series  by  considering  each 
delay  vector: 


=  {x{tj),x{tj  -  T),x{tj  -  2r),  ...,x{tj  -  (M  -  l)r)). 
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Figure  2.5:  Chaotic  time  series  from  the  Lorenz  system  used  to  generate  the  delay 
plot. 

Say  we  want  to  predict  the  next  value  of  the  time  series  a;(ta+i),  (since  x{ta)  is  the 
last  point  in  the  time  series.)  Then  in  the  embedding  space  3?^  we  locate  the 
k  nearest  neighbors  to  Y{ta,).  Denote  the  evolution  rule  G  :  9?^  9?^,  such  that 

G(Y (tj))  =  Y (tj-i-i).  Since  we  know  where  each  of  the  k  nearest  neighbors  is  mapped, 
we  can  locally  approximate  G  by  linearly  regressing  these  k  nearest  neighbors.  This 
local  approximation  of  G  is  given  by  Y (to+i)  =  G(Y {to))  =  DG  •  (Y {tg,))  +  b,  where 
DG  is  the  Jacobian  derivative  approximated  by  regression  on  the  k  near  neighbors. 

This  is  the  general  method  currently  found  in  the  literature  (Kantz  and 
Schreiber,  1997;  Abarbanel,  1996;  Sauer,  1994).  Our  contribution  will  be  both 
to  decide  when  the  local  linear  model  is  statistically  significant,  or  a  higher  order 
polynomial  is  needed,  and  to  say  how  confident  we  are  of  the  predictions.  To  do  this 
we  need  to  discuss  the  statistical  methods  we  will  use. 

3.  STATISTICS 

We  now  make  a  diversion  into  statistics  to  review  some  of  the  tools  we  require  in 
our  analysis  of  chaotic  dynamical  systems.  We  drew  heavily  from  Neter,  Wasserman, 
and  Kutner’s  Applied  Linear  Statistical  Models,  and  Walpole  and  Myers’s  Probability 
and  Statistics  for  Engineers. 
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3.1.  CONFIDENCE  INTERVALS 

In  making  a  prediction  we  are  making  an  estimation  of  a  quantity.  We  would 
eventually  like  to  say  how  good  the  estimation  is.  Let  us  get  away  from  predictions 
for  a  while  and  begin  with  a  simpler  idea:  the  mean  of  a  set  of  data.  Suppose  we  were 
given  a  sample  set  of  data  and  asked  to  estimate  the  population  mean  from  the  data. 
We  could  sum  the  data  and  then  divide  by  the  number  of  data  points.  Would  this 
sample  mean  provide  a  good  estimate  of  the  population  mean?  It  certainly  provides 
the  best  point  estimate  of  the  population  mean,  but  that  does  not  provide  an  answer 
to  the  question  of  quality.  Statistics  answers  the  question  by  providing  a  range  of 
estimation  instead  of  just  a  point  estimate.  This  range  is  known  as  a  confidence 
interval. 

Suppose  that  we  are  trying  to  estimate  a  statistical  property  of  a  population 
like  the  mean  and  that  for  any  given  sample  of  the  population,  we  developed  an 
algorithm  which  defined  a  range  of  values.  This  range  of  values  either  includes  the 
true  population  mean  or  it  does  not.  Say  we  have  multiple  samples  of  the  population, 
each  yielding  a  range  of  values  by  the  same  algorithm.  If  we  know  the  population  value 
of  the  statistic  we  are  attempting  to  estimate,  we  can  determine  the  percentage  of 
intervals  which  contain  the  value  of  the  population  statistic.  Suppose  this  percentage 
is  90%.  Then  given  a  random  sample  of  data  from  the  population,  we  will  be  able  to 
determine  a  range  of  values  which  have  a  90%  chance  of  containing  the  population 
statistic.  We  could  then  call  this  range  of  values  a  90%  confidence  interval. 

Now,  the  hypotheses  of  the  above  situation  will  not  apply  for  most  experi¬ 
mentally  obtained  data  sets.  For,  if  we  know  the  true  value  of  the  population  statistic 
which  we  are  attempting  to  estimate,  we  would  not  need  to  estimate  it.  This  idea  of  a 
confidence  interval  is  very  useful  if  we  know  the  distribution  of  the  population  we  are 
sampling.  But,  if  aU  we  have  access  to  is  raw  data,  how  can  we  know  the  distribution 
of  the  population.  For  the  mean  of  the  population,  this  question  is  answered  by  the 
Central  Limit  Theorem. 

3.2.  THE  CENTRAL  LIMIT  THEOREM 

Suppose  that  we  are  taking  independent  random  samples  from  a  population.  Say 
we  take  a  number  of  different  samples,  each  containing  n  data  points.  The  Central 
Limit  Theorem  tells  us  that  as  n  approaches  infinity,  the  distribution  of  sample  means 
will  limit  on  a  normal  distribution,  the  graph  of  which  is  the  famous  bell  shaped  curve 
(we  will  rigorously  examine  the  normal  distribution  in  the  next  section). 

Figure  (3.1)  illustrates  this  fact.  Various  random  samples  of  data  were  taken 


17 


Sample  mean  value 


Figure  3.1:  Histogram  of  sample  means  taken  from  a  population  with  a  uniform 
distribution. 

from  the  uniform  distribution.  Each  sample  of  data  contained  ten  points.  One- 
thousand  samples  were  taken,  and  then  these  sample  means  were  plotted  as  a  his¬ 
togram.  One  should  note  that  the  histogram  is  roughly  bell-shaped. 

3.3.  THE  NORMAL  DISTRIBUTION 

For  a  continuous  random  variable,  the  probability  density  function  (PDF)  is 
defined  as  a  function  such  that  the  area  under  the  function  between  two  values  is  the 
probability  that  a  random  sample  of  the  variable  will  lie  between  the  two  values.  For 
the  normal  distribution,  the  PDF  is  a  bell-shaped  curve.  The  normal  distribution 
has  two  parameters  which  completely  determine  its  shape.  These  parameters  are  the 
mean  (/i)  and  the  standard  deviation  (cr).  The  mean  locates  the  center  of  the  curve, 
and  the  standard  deviation  determines  how  flat  or  sharp  the  curve  is.  The  equation 
for  the  normal  PDF  is 

n{x\ix,a)  =  — ^  ^ 

v27rcr 

Therefore  the  probability  that  a  random  sample  of  X  will  lie  between  Xi  and  X2  is 
given  by  the  following  equation, 

Pr(rci  <X<X2)  =  r 

V  27ra  Jxi 


18 


Figure  3.2:  Confidence  interval  for  normal  distribution.  Confidence  level  is  1  —  o;. 

For  a  given  /i,  a,  and  a,  define  cr)  to  be  the  value  of  x  which  satisfies  the 

following  equation, 

7-00  y/2'Ka 

In  our  applications  we  use  a  computer  program  (Jones,  1993)  to  determine  the  value 
of  Xa{iJL,  a),  for  a  given  a,  fx,  and  a . 

We  now  know  enough  to  determine  confidence  intervals  on  a  normal  random 
variable  X.  Say  we  know  the  values  of  fx  and  cr,  and  we  want  to  determine  a  1  —  a 
confidence  interval  centered  at  x  =  /x.  We  divide  the  area  imder  the  curve  into  three 
sections.  In  the  middle  section  we  have  1  —  a  of  the  area,  which  leaves  a/2  of  the 
area  on  the  left  and  right  (since  the  total  area  under  the  PDF  is  1.)  The  left  and 
right  endpoints  of  the  interval  are  then  L  =  Xa/2{ix,  cr)  and  R  =  a;i_a/2 (/c.,  cr)  as  shown 
in  Figure  (3.2)  (Walpole  and  Myers,  1972). 

3.4.  LINEAR  REGRESSION 

Now  instead  of  having  a  problem  with  one  random  variable,  let  us  suppose  that 
we  have  two  variables,  which  depend  on  each  other.  By  plotting  the  data  on  a 
graph,  we  could  get  a  rough  idea  of  whether  or  not  the  two  variables  have  a  linear 
dependence.  Suppose  that  we  have  the  capability  to  set  a  variable  X  to  several 
different  values  and  then  we  can  measure  the  variable  Y  for  each  value  of  X.  We 
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then  plot  the  data  and,  if  the  data  lies  approximately  along  a  line,  we  would  like  to 
determine  the  line  of  best  fit.  The  statistical  technique  for  determining  this  line  is 
known  as  linear  least  squares  regression. 

We  assume  that  we  can  set  the  independent  variable  X  with  a  high  degree 
of  accuracy.  We  then  observe  Y  and  assume  that  Y  contains  some  random  error. 
We  denote  the  ith  observation  as  {xi,yi).  We  wiU  assume  that  there  is  a  hnear 
relationship  between  Y  and  X,  but  because  of  some  random  error  the  observations 
do  not  lie  exactly  on  a  line.  We  denote  this  relationship  by  the  following. 

Yi  =  X  +  Pxi  +  Ei, 
where  E  is  a  Gaussian  noise  term  with  mean  /i  =  0. 

3.5.  ESTIMATORS  OF  SLOPE  AND  INTERCEPT 

Given  the  collection  of  data  {{xi,yi)  :  i  G  {1,2,  ...,n}}  for  some  n,  we  would 
like  to  know  the  equation  for  the  line  of  best  fit  through  the  data.  We  will  need  to 

determine  a,  the  intercept,  and  b,  the  slope  of  this  line.  The  equation  of  this  line  is 
given  by 


y  =  a  +  bx, 


where  y  is  the  predicted  value  of  y  for  a  given  x. 

The  best  line  is  defined  to  be  the  line  that  minimizes  the  sum  of  the  squares 
of  the  distance  that  each  yi  lies  from  this  line.  We  therefore  want  to  minimize 

SSB  =  f^(y,  -a-  bx,)\ 

1=1 

To  do  this  we  set  =  0  and  solve  for  a  and  b. 


d(SSE) 

da 


n 


-2E(» 

i=l 


-a-  bxi)  =  0. 


n 

-a-  bxi)  =  0, 

i=\ 

n  n  n 

i=l  i=l  i=l 


So, 


20 


Finally, 

Now  for  the  other  partial. 

d{SSE) 

db 


na  +  b'^Xi  =  Y^yi- 

i=l  i=l 


=  -2 ^[(yi  -a-  bxi)x^]  =  0, 


-  axi  -  bxl)  =  0. 

i=l 

Finally, 

a  ^2,  Xi-\-b  ^22  =  ^2 

i=l  t=l 

These  two  equations  are  known  as  the  normal  equations  for  this  model.  Solving  for 
a  and  b  yields, 


6  = 


n  *  SXY  -SX*SY 
n*SX2-{SXY  '' 
a  =  y  —  bx, 


where,  SXY  =  E7==iXiyi,  SX2  =  SX  =  E7=iXi,  SY  -  ELi^^and  y 

and  X  are  the  average  values  of  the  observations. 

It  is  important  to  keep  in  mind  that  y  is  a  random  variable.  Therefore, 
for  different  observations  of  the  E’s  we  will  get  different  data  sets.  Each  different 
data  set  will  yield  a  different  estimate  of  the  true  slope  and  intercept.  Some  of  these 
estimates  will  be  better  than  others.  Therefore,  it  is  useful  to  think  of  the  regressed 
slope  and  intercept  as  random  variables  also.  Given  a  set  of  data  it  would  therefore 
be  useful  to  know  how  well  the  regressed  sample  values  (a,  b)  approximate  the  true 
population  values  (A,/5).  To  do  this,  we  must  introduce  a  new  distribution. 


3.6.  THE  t-DISTRIBUTION 

In  order  to  determine  confidence  intervals  on  the  estimates  of  the  slope  and 
intercept  of  the  line  of  best  fit  we  must  use  the  {Student)  t-distribution.  The  PDF 
of  the  t-distribution  is  given  as  follows. 
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Figure  3.3:  The  i-distribution  for  various  degrees  of  freedom,  v.  As  v  increases  the 
t-distribution  limits  on  the  normal  distribution. 

where  v  is  the  number  of  degrees  of  freedom  and  r(2;)  is  the  gamma  function  defined 
as 

POO 

r(.2)  -  /  af-^e-^dx. 

Jo 

(Note  that  r(n)  =  {n—  1)!,  for  any  n  G  {1, 2, ...}.)  The  t-distribution  is  used  when 
one  does  not  know  the  standard  deviation  of  the  population  that  is  being  sampled  and 
the  sample  size  is  too  small  to  assume  that  the  sample  standard  deviation  provides 
a  good  estimate  of  the  population  standard  deviation.  The  t-distribution,  like  the 
normal  distribution,  is  bell-shaped.  As  the  number  of  degrees  of  freedom  approaches 
infinity,  the  t-distribution  approaches  the  normal  distribution  (see  Figure  (3.3)). 

In  a  similar  manner  as  we  did  for  the  normal  distribution,  we  will  define 
as  the  value  of  t  such  that  a  of  the  area  hes  to  the  left  of  It  is  important  to 

note  that  only  makes  sense  if  we  know  the  associated  degrees  of  freedom  for  the 
distribution  (Walpole  and  Myers,  1972).  Again  we  use  a  computer  algorithm  to  find 
its  value  (Jones,  1993). 

3.7.  APPROXIMATION  OF  VARIANCE 

We  use  the  t-distribution  when  we  do  not  know  the  true  value  of  the  population 
variance  cr^.  Instead  of  cr^,  we  use  its  unbiased  estimate  MSE  (mean  squared  error). 
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Say  we  have  a  set  of  data  {xi,  yi).  We  fit  the  data  with  a  line  y  =  a  +  bx.  For  each 
Xi,  we  determine  the  predicted  response  yi  =  a +  bXi.  The  sum  of  the  squares  of  the 
errors  SSE  is 

SSB  =  -  iif. 

t=l 

The  mean  squared  error,  defined  as 


MSE  = 


SSE 
n  —  p’ 


where  p  is  the  number  of  parameters  which  were  estimated  in  the  regression  (in 
this  case  p  —  2),  provides  an  tmbiased  estimate  of  the  population  variance  (Neter, 
Wasserman,  and  Kutner,  1974,  p.  50). 


3.8.  CONFIDENCE  INTERVALS  ON  THE  ESTIMATED  SLOPE  AND 
INTERCEPT 

Suppose  we  have  a  linear  model  which  generates  data.  Then  for  each  set  of  n 
observations,  we  will  be  able  to  generate  a  confidence  interval  around  our  sample  a 
and  b.  We  expect  that  the  appropriate  percentage  of  these  intervals  will  contain  the 
true  value  of  A  and  {3.  The  formulas  for  the  (1  —  «)  confidence  intervals  are  as  follows. 


'sfSxx 


<P  <b  + 


^1— q;/2® 

y/Sxx 


a 


SX2  ^  ^  ^  ^  ^l-a/2Sy/SX2 

'\/tiSxx  y^^Exx 


where  s  is  the  estimated  standard  deviation. 


s  = 


Sxx  =  -  xf  =  SX2 

i=l 


SX^ 
n  ’ 


and  the  t  distribution  has  v  =  n  — 2  degrees  of  freedom. 
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3.9.  MEAN  PREDICTED  RESPONSE  VS.  SINGLE  PREDICTED  RE¬ 
SPONSE 

The  true  response  of  Y  on  Xq  is  the  value  of  Y  the  model  will  return  to  us  if 
we  input  Xq.  If  all  we  have  is  a  set  of  data,  and  we  do  not  know  the  model  from 
which  the  data  was  generated,  then  we  would  like  to  estimate  the  true  response  of 
Xq  with  a  predicted  response  in  the  form  of  a  confidence  interval.  We  can  look  at  a 
single  response  or  a  mean  response.  A  single  response  is  what  we  expect  the  model 
to  return  from  the  input  of  one  single  xq.  The  mean  predicted  response  is  what  we 
would  expect  the  model  to  return  on  average. 

For  an  estimate  a  and  b,  the  point  predicted  response  of  xq  is  po  =  a+bxQ.  Of 
course  there  is  an  associated  confidence  interval  for  the  true  response  to  Xq,  denoted 
Vo- 

!A)“(i-a/2t/v5E(l  +  i  +  <  ,^,  <  +  i  + 

V  ^  ^xx  V  ^  ^xx 

where  again  the  t-distribution  has  u  =  n  — 2  degrees  of  freedom.  This  formula  applies 
to  a  single  response  only.  We  can  also  look  at  confidence  intervals  on  an  average 
response  to  xq,  whose  true  value  is  denoted  as  /J'yjxg-  The  formula  for  the  confidence 
intervals  on  the  mean  response  is 

yo  -  +  <y„  +  + 

V  'fl  Oxx  V  ^  ^xx 

The  confidence  interval  for  the  mean  response  will  be  narrower  than  the  confidence 
interval  for  the  single  response.  This  is  because  the  single  response  has  the  gaussian 
error  contribution,  while  in  the  mean  response  the  gaussian  error  averages  out  to 
zero. 

3.10.  THE  NORMAL  EQUATIONS  IN  MATRIX  FORM 

We  will  now  demonstrate  a  more  elegant  way  of  looking  at  least-squares  re¬ 
gression,  which  allows  for  extension  to  the  higher  order  problems  we  consider  later. 
Again,  let  us  suppose  that  we  have  a  collection  of  data  {{xi,yi)  :  i  e  {l,2,...,n}}  for 
some  n.  Define  Y  to  be  an  n  x  1  colixmn  vector  consisting  of  the  piS.  Define  X  to 
be  an  n  X  2  matrix  whose  first  colmnn  is  all  ones  and  whose  second  colmnn  is  the 
XiS.  Finally  define 
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where  6o  is  the  intercept  and  6i  is  the  slope  of  the  line  of  best  fit  to  the  data. 

Let  us  examine  the  matrix  equation 

X'Xb  =  X'Y, 

where  X^  represents  the  transpose  of  the  matrix  X.  Our  notation  to  look  at  the  ith 
row  and  jth  coliunn  of  a  matrix  A  will  be  {A)ij.  X^X  is  a  2  X  2  matrix,  since  X  is 
2  X  n,  and  X  is  n  x  2.  The  first  row  of  X'  is  identical  to  the  first  column  of  X,  both 
of  which  consist  of  n  ones.  Therefore  the  upper  left  entry  of  X'X  is  the  sum  of  n 
ones,  or 

i=l 

The  upper  right  entry  is  the  same  as  the  lower  left  entry,  which  is  the  product  of  a 
row  of  ones  with  a  column  of  x’s.  So, 

(X'X)„  =  (X'X)„  =  i:(l)(x,)  = 

i=l  t=l 

Finally,  the  lower  right  entry  is  a  product  of  the  row  of  x^s  with  the  column  of  x’s. 

(x'x)„  =  =  Lfe)"- 

On  the  right  side  of  the  equation  we  have  X'Y,  which  is  size  2x1. 

(x'Y),,  =  E(i)(yi)  =  Eyi 

i=\  i=l 


Now  we  see 
equations: 


(X'Y)2i  =  jz{xi){yi)  = 

i—\ 

that  the  equation  X'Xb  =  X'Y  expands  to  yield 

n  n 

nbo  +  biY^Xi  =  E?/* 

i=l  i=l 


the  following  two 


n  n  u 

6oE^* + E^i  = 

i=i  t=i  t=i 

These  equations  are  the  normal  equations  introduced  previously  (with  a  minor  re¬ 
naming  of  the  variables).  The  least  squares  estimates  for  the  slope  and  the  intercept, 

b  =  (X'X)-^X'Y, 

assuming  that  (X'X)"^  exists  (Neter,  Wasserman,  and  Kutner,  1974). 
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3.11.  MULTIPLE  INDEPENDENT  VARIABLES 

In  our  discussion  of  linear  regression  thus  far,  we  have  assumed  that  our  data 
was  being  generated  by  the  model  =  /?o  +  (again  with  a  slight  change 

of  notation).  We  can  also  look  at  models  of  more  than  one  independent  variable. 
We  wiU  denote  a  model  of  this  type  by  Yi  =  +  •••  +  Pk^ik  +  Ei, 

where  k  represents  the  number  of  independent  variables  which  determine  Y.  Again 
we  denote  the  least  squares  estimate  to  each  Pj  as  bj.  To  determine  each  bj,  we  form 
the  sum  of  squares  of  the  errors,  differentiate  with  respect  to  each  bj,  and  set  each 
derivative  equal  to  zero.  This  yields  k  +  1  equations  in  A:  +  1  variables. 

71 

SSE  =  J2{yi  -  bo-  biXii  -  ...  -  bkXik)'^ 


For  each 

=  -2^[(yi  -bo-  byXii  -  ...  -  bkXik)Xij]  =  0 

realizing  that  Xio  is  defined  to  be  identically  1. 

So  for  each  j, 

n  n  n  n  n 

bo  ^  ^  Xij  +  bi  ^  XiiXij  +  ...  +  bj  ^  Y  ■■■  Y  bk  ^  ^  XikXij  ^ 

i=l  i=l  t=l  t=l  i=l 

Now  let  US  redefine  some  matrices.  For  each  i  G  {1,2, ...,  n}  and  j  G  {0, 1, 2,  ■■■,k}, 
define  (X)^^.  =  Xij,  the  ith  observation  of  the  jth  independent  variable,  remembering 
that  Xio  =  1. 

Let 


r&oi 


[hi 

With  a  httle  work,  one  can  see  that  the  above  system  of  equations  can  again  be 
represented  by  the  single  matrix  equation  X'Xb  =  X'Y.  So  again  b  =  (X  X)  X'Y. 

3.12.  ANALYSIS  OF  VARIANCE  (ANOVA) 

Recall  that  each  6,  is  itself  a  random  variable.  It  will  be  useful  to  compute  the 
variance  of  each. 
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Again  assume  that  a  true  model  Yi  =  13-^^Xi  +  Ei  is  generating  data.  Assume 

that  each  observation  is  independent  and  has  a  variance  of  (denoted  cr^  = 
Var{Ei)).  Then  Var{Yi)  =  cr‘^,  for  each  i.  Now  we  compute  Var{bo)  and  Var{bi). 
In  order  to  do  this  we  need  the  following  theorem  concerning  the  variance  of  the  sum 
of  random  variables. 


THEOREM: 

Suppose  that  {Yi  |  i  €  {1,2,  ...,n}}  is  a  set  of  independent  random  variables. 
Let  {oj  I  i  e  {1,  2,  ...,?^}}  be  a  set  of  constant  coefficients.  Then 

Var{a^Yi  +  ...  +  aiYi  +  ...a^y^)  =  alVar{Y^)  +  ...  +  a^Var(Yi)  +  ...  +  alVar{Yn). 


So  now  for  Var{bi).  Recall  that 


Since, 


and, 


6i  = 


nEti4-(Er.,a;,) 


.12 


^^{xi  —  x)Yi  =  ^(rEjYi  —  xYi)  =  y^^XjYj  —  x'^^Yi 


i=l 


i— 1 

n  n 


i=l 


t=l 


n 


n  n 


^{xi  -  xf  =  =  E'^i  -  2^E^*  +  E^' 


t=i 


t=i 


n  on 


t=l  i=l  i=\ 

2 


nx 


2=1 


i=l  i=l 

we  can  write  b\  as 

6i  = 


•  1  ^  r  1 

2=1  2=1 


Er.i(2n  -  3!)ri 
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By  representing  bi  as  a  sum  of  the  l^’s, 

{Xi  -  x)Yi  {Xi-  X)Y^  {Xn  -  x)Yn 

~  v^n  =\2  '  ■■■  '  _  ;=\2  "■  frr.  _ 


Etife  -  *)= 


and  by  the  above  theorem  we  derive  that 


Var{bi) 


Now  for  &o-  Recall  that 


So  (by  the  above  theorem) 


bo  =  y-  bix. 


Var{bo)  =  Var{y)  +  x^Var{bi). 


Now  we  compute, 


Var{y)  =  +  -  +  “^n) 


And, 


=  \var{Y{)  +  ...  +  \var{Yi)  +  ...  +  \var{Yn) 

V?  Ti^ 

ncr^  1  o 
=  —  +  ...  H — r  =  —  =  — cr  . 

n 


x^Var(bi)  = 


i/aKW  =  -g  + 


ELife-gr  +  (E?..^i) 
""  Elite 
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4  -  (Eti  +  (E?.,  ^  ^ 

2  E"=l4 


££.i4 

>E?.i(a:i-4“ 


Var{hi)  = 


Y.U{^i-xf 


a;? 

•t  r  fu  \  2  _ 1  *^2 _ 

V^O^to)  -<’  _  (XS-Y^if 

After  that  harrowing  derivation  we  will  again  demonstrate  the  conciseness  of 
using  matrix  notation.  We  define  the  variance  of  a  7i  X  1  column  vector  d  to  be  an 
n  X  n  matrix  where 

{Var{d))ij  =  Cov{di,dj). 

Even  though  we  have  not  defined  cov  (the  covariance  of  two  random  variables),  it  is 
sufficient  to  know  that  Cov{di,di)  =  Var{di).  It  is  true  that  V^ar(b)  =  cr^(X'X) 

We  will  now  show  that  (V'ar(b))ij  =  Varlbo)  and  that  (yar(b))22  =  Var{bi). 


X'X  = 


n  Er=i 

En  2 


a='(X'X)-'  = 


^Er=a^f-(i:?=i^on ^ 


En  2  _  Y^n 


Indeed  our  assertion  that  (Far(b))ii  =  Var{bo)  and  (Far(b))22  =  Var{bi)  is  true 
(Neter,  Wasserman,  and  Kutner,  1974). 


3.13.  CONFIDENCE  INTERVALS  ON  PREDICTIONS 

Matrix  notation  gives  us  a  very  concise  way  of  determining  confidence  intervals 
on  fits  of  data.  Say  we  want  to  determine  the  confidence  interval  on  a  response  to  Xq. 
First  we  define, 

Xh  = 
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Note  that  the  point  response  is  given  by  t/h  =  Vo  =  We  will  show  that  the 

width  of  our  confidence  interval  is 


w  =  ti^a/2y/MSE{K'^{X>X)--^XH)  =  h-a/2\lMSE{-  + 
by  showing  that 


1  ,  (xo-a;)2 


), 


xUX'X)-x.  =  i+(?^, 

^  '^XX 


First,  expand  the  matrix  notation; 

1 


x;(X'X)-'x,  = 


Let  us  focus  on  the  numerator, 

''n  „2 


[1  a:o] 


En  ^2  _  'ipn 

i=l  Z^i=l  *^1 

-  EILl  Xi  n 


■  1 

Xq 

[1  a;o] 


E?=i  x?  -  Er=i 
-  Er=i  Xi  n 


■  1 

Xq 

n 


=  ~  +  nxl 


t=i 


2=1 
n 


=  +  nxl  -  2:r„^Xi 


2=1 


n 


n 


2=1 

n 


=  E*?  -  2^^%^  +  +  ”5  -  2loE^< 


2=1 


n 


2=1 


n  (\^n  \2 

^  ^^2  _  2^^i^lL  +  2n^2  +  nxg  -  2n;ro-^'='  ^ 


2=1 


n 

n 


2=1 


=  ^  a;?  —  2a;  ^  Xj  +  nx^  +  uXq  —  2nxQX  + 

i=l 

=  +  ^(^0  -  xY 


n 


nx^ 


2“1 


So  the  entire  fraction  equals 

Er=i(^t  -  +  n{xo  -  xf  _  Er=i(^i  ~  +  ^(^0  ~ 


nY!i=i{xi-xY 


1.  (xo  -  xY 


n"Y.U{xi-xY' 

thus  establishing  our  use  of  matrix  notation.  So  we  now  have  a  concise  way  of 
computing  confidence  intervals  on  predicted  responses. 
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3.14.  HIGHER  ORDER  POLYNOMIAL  FITS  TO  SINGLE  INDEPEN¬ 
DENT  VARIABLE 

The  method  of  hnear  regression  of  multiple  independent  variables  provides  an 
easy  way  to  fit  a  function  of  higher  powers  of  a  single  independent  variable  to  data. 
We  will  demonstrate  this  in  the  case  of  a  parabolic  model.  Suppose  we  have  data 
generated  from  the  model  Yi  =  /?o  +  l^i^i  +  +  ^i-  Then  we  simply  adjoin  a 

third  column  to  X,  consisting  of  the  square  of  each  Xi,  and  proceed  to  compute 
b  =  (X'X)-'X'Y. 

We  now  illustrate  the  fitting  of.  higher  order  polynomials  to  data.  We  use 
the  model  Yi  =  l+^i  +  2a:f  +  Ei  to  generate  the  following  data. 


i 

Xi 

Yi 

1 

1.0 

3.23 

2 

2.5 

20.54 

3 

3.0 

16.25 

4 

5.2 

73.54 

5 

6.7 

118.76 

6 

7.3 

139.93 

7 

8.1 

179.06 

8 

9.0 

195.49 

So 


■  1  1.0  1.0 
1  2.5  6.25 
1  3.0  9.00 
1  5.2  27.04 
1  6.7  44.89  ’ 

1  7.3  53.29 
1  8.1  65.61 
1  9.0  81.00  _ 


■  3.23 

20.54 
16.25 

73.54 
118.76  ’ 
139.93 
179.06 
195.49 
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Figure  3.4;  True  parabola  which  is  used  to  generate  the  data  versus  parabola  of  best 
fit. 


and  we  compute 


b  =  (X'X)“^X'Y 


-6.78 

4.55 

2.10 


Figure  (3.4)  shows  the  data,  the  true  parabola,  and  the  fit  parabola.  By 
looking  at  the  graph  it  should  appear  that  the  regressed  model  fits  the  data  better 
than  the  true  model.  The  criterion  we  use  to  determine  how  well  a  model  fits  data 
is  the  sum  of  the  squares  of  the  error  (SSE).  For  the  true  model,  SSE  =  478.1, 
while  for  the  regressed  model  SSE  =  315.18,  the  minimum  possible  value  for  an  SSE 
obtained  from  a  parabola  running  through  the  given  data. 

Please  note  that  while  all  the  above  formulas  were  proven  for  either  a  linear 
or  quadratic  fit,  the  formulas  remain  the  same  for  a  polynomial  fit  of  any  degree 
(Walpole  and  Myers,  p.  335).  We  simply  adjoin  colunms  to  X  of  the  original  data 
raised  to  successive  powers. 


3.15.  APPROPRIATENESS  OF  HIGHER  ORDER  FITS  TO  DATA 

Any  analytic  function  has  a  Taylor  expansion.  This  means  that  any  well-behaved 
function  can  be  approximated  by  a  polynomial  function.  So  in  terms  of  regression 
fitting,  we  now  have  the  tools  to  fit  a  polynomial  of  any  order  to  data  generated  from 
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an  analytical  dynamical  system  Xn+i  =  f{xn),  in  a  “small”  neighborhood  of  the  point 
whose  evolution  we  wish  to  predict. 

3.16.  STATISTICAL  TESTING 

Before  we  jump  into  the  following  test,  it  would  be  wise  to  say  a  few  words  about 
general  statistical  hypothesis  testing.  Often  a  statistical  test  consists  of  testing 
whether  to  accept  or  reject  a  statistical  hypothesis.  The  null  hypothesis  is  the 
hypothesis  which  will  be  accepted  unless  there  is  evidence  to  the  contrary,  in  which 
case  we  will  reject  the  null  hypothesis  and  conclude  the  alternative  hypothesis,  which 
is  the  negation  of  the  null  hypothesis.  There  are  two  types  of  errors  which  may  be 
made  in  this  type  of  testing,  usually  denoted  a  Type  I  error  or  a  Type  II  error.  A 
Type  I  error  is  made  when  one  rejects  a  true  null  hypothesis.  A  Type  II  error  is 
made  when  one  accepts  a  false  null  hypothesis.  The  alpha  level  in  this  test  represents 
the  chance  of  making  a  Type  I  error.  There  is  also  a  beta  level  which  represents  the 
chance  of  making  a  Type  II  error.  In  most  tests  of  this  type,  the  alpha  level  is  set, 
but  the  beta  level  is  unable  to  be  computed.  In  determining  confidence  intervals,  we 
do  not  want  to  set  the  alpha  level  too  low,  because  it  makes  the  confidence  interval 
too  large  to  be  useful.  In  a  test  of  this  sort,  if  we  set  the  alpha  level  too  low,  i.e. 
we  choose  to  rarely  reject  the  null  hypothesis,  it  will  drive  the  beta  level  up  (Walpole 
and  Myers,  1972). 

A  useful  analogy  can  be  made  between  this  sort  of  testing  and  a  trial  in  an 
American  court.  In  court  the  null  hypothesis  would  be  that  the  defendant  is  not 
guilty,  since  he  is  “innocent  rmtil  proven  guilty.”  The  alternative  hypothesis  is  that 
the  defendant  is  guilty.  In  the  American  justice  system  the  alpha  level  is  set  to  be 
very  small;  the  idea  being  that  we  rarely  convict  an  innocent  man.  We  accept  the 
fact  that  the  beta  level  may  be  significantly  high,  as  described  by  the  idea  that  we 
would  rather  set  free  ten  guilty  persons  than  convict  one  innocent  man.  A  very 
low  alpha  level  corresponds  to  a  level  of  proof  of  “beyond  a  reasonable  doubt” ,  while 
a  higher  alpha  level  would  represent  “clear  and  convincing  evidence”  or  higher  yet 
“preponderance  of  the  evidence”  which  is  used  in  several  types  of  military  hearings. 

3.17.  THE  t-TEST  FOR  DEGREE  OF  POLYNOMIAL,  FULL  MODEL 
VERSUS  SUB-MODEL 

Say  we  are  given  a  set  of  data  which  appears  to  have  a  slight  curve  to  it.  It 
would  be  nice  to  know  whether  or  not  the  best  fit  parabola  yields  any  more  information 
than  the  linear  fit.  Is  the  “slight  curve”  statistically  significant  enough  to  support 
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a  quadratic  full  model  over  the  simpler  linear  sub-model.  The  t-test  is  just  such  a 
test.  Let  us  suppose  that  we  fit  the  function  =  6o  +  biXi  -|-  b2xj  to  a  set  of  data. 
Let  ^2  =  E{b2),  i.e.  P2  the  expected  value  of  62-  We  will  test  the  following  set  of 
hypotheses: 


Ho  •  P2  —  ^ 
Ha  : 


Where  Hq  is  the  null  hypothesis  and  Ha  is  the  alternative  hypothesis. 

To  do  this  we  will  compute  a  test  statistic  tg  and  compare  it  to  i(i-a/2,n-3)  = 
icrnnp-  The  variable  ts  is  computed  as  follows. 


ts  = 


b2 

s{62} 


where  5(62}  =  y  ^^{^2}  is  the  unbiased  estimate  of  the  standard  deviation  of  62-  s^{&2} 

is  easy  to  compute.  First  we  compute  s^{b}  =  MSE(K'X.)  ^  and  then  pick  off  the 
appropriate  value  from  the  matrix.  If  tg  <  i(i-a/2,n-3),  we  conclude  Hq-  P2  ~ 
the  other  hand,  if  tg  >  ^(l-a/2,n-3),  we  conclude  Ha-  (3 2  ^  0.  (Neter,  Wasserman, 
and  Kutner,  1974) 

We  view  this  procedure  as  the  test  of  a  full  model  versus  its  sub-model.  The 
sub-model  is  valid  when  the  full  model  degenerates  by  having  a  very  small  leading 
coefficient. 


4.  STATISTICS  IN  DYNAMICAL  SYSTEMS 

We  assume  a  low  dimensional  chaotic  dynamical  system,  subject  to  additive 
noise,  Xn+i  =  f{xn)  +  En-  We  use  the  above  ideas  to  select  the  statistically  signifi¬ 
cant  polynomial  model  in  the  neighborhood  of  a  point  Xp  which  we  wish  to  predict  the 
response  f{xp).  Taylor’s  theorem  gives  us  good  reason  to  believe  that  a  polynomial 
will  approximate  the  function  in  a  small  enough  neighborhood.  This  represents  a 
significant  improvement  in  the  current  state  of  technology  in  which  researchers  select 
a  linear  model  and  haphazardly  choose  several  near  neighbors.  We  also  offer  a  state¬ 
ment  on  the  quality  of  the  prediction  in  terms  of  confidence.  These  advancements  in 
dynamical  systems  theory  are  made  by  slight  modifications  on  the  above  tools  bor¬ 
rowed  from  statistics  theory.  What  follows  can  be  considered  to  be  new  technology 
we  are  developing  for  the  physical  scientist. 
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Figure  4.1:  Time  series  from  the  Logistic  Map. 

4.1.  FINDING  K  FOR  A  SPECIFIED  DEGREE  POLYNOMIAL  FIT 

We  now  return  to  the  setting  of  time  delay  embedding  prediction  and  ask  the 
question:  How  do  we  choose  the  best  number  of  near  neighbors  to  use  in  our  regression 
of  the  data?  Say  we  want  to  use  a  linear  fit  to  make  predictions  on  the  time  series 
generated  from  the  logistic  map  with  an  additive  gaussian  error  term.  First,  we 
generate  a  time  series  starting  from  an  initial  condition  of  Xi  =  0.4  and,  for  each 
n  e  {2,  ...,1000},  we  set  Xn  =  f{xn-i)  =  4a:„_i(l  -  Xn-i).  We  add  a  normally 
distributed  random  noise  term  to  each  Xn  producing  the  following  time  series  shown 
in  Figure  (4.1). 

Say  we  want  to  determine  what  point  in  the  time  series  would  follow  Xq  =  0.35. 
First,  we  embed  the  time  series  in  one  dimension  with  a  time  delay  of  one  iterate, 
and  then  we  regress  a  line  from  the  k  near  neighbors  to  Xq.  Now  we  have  to  choose  a 
value  of  k.  Let  us  regress  a  line  from  the  300  near  neighbors  (shown  in  red  in  Figure 
(4.2))  to  Xq.  It  is  apparent  from  Figure  (4.2)  that  the  data  is  significantly  curved, 
and  the  line  of  best  fit  does  not  accurately  fit  the  data  at  Xq. 

Using  the  t-test,  we  compute  that  the  significance  level  for  this  fit  is  ct  fa  0. 
This  means  that  we  are  almost  100%  positive  that  we  are  not  making  a  Type-I  error. 
Which  also  means  that  we  are  almost  100%  positive  we  are  making  a  Type-II  error. 
Recall  that  a  Type-I  error  is  erroneously  calling  linear  data  nonlinear.  The  way  to 
avoid  this  type  of  error  100%  of  the  time  is  never  to  call  any  data  nonlinear.  To  sum 
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Figure  4.2:  Line  of  best  fit,  regressed  from  too  many  near  neighbors, 
it  up,  this  is  a  very  bad  fit. 

Let’s  examine  the  flip-side  of  the  same  coin.  What  if  we  use  too  few  near 
neighbors?  Figure  (4.3)  shows  a  fit  to  the  10  near  neighbors  of  Xq.  To  help  see  the 
problem  Figure  (4.4)  is  a  close  up  of  the  region  around  Xo  =  0.35.  Since  the  slope  is 
clearly  wrong,  we  do  not  want  to  use  a  fit  such  as  this. 

What  happens  in  the  very  small  region  around  Xq  is  that  the  statistical  errors 
become  too  significant  to  determine  the  shape  of  the  attractor.  The  problem  becomes 
worse  if  we  use  a  very  dense  set  of  points  in  a  very  small  interval  of  the  x-axis.  We 
have  been  looking  at  an  attractor  with  N  =  1000  points  on  it.  Now  lets  fill  the 
attractor  with  N  =  100000  points  and  fit  a  line  and  a  parabola  to  the  200  near 
neighbors  of  Xq  =  0.35.  Figure  (4.5)  shows  the  fitted  parabola  and  line.  Figure  (4.6) 
is  the  close-up  view. 

What  is  wrong  with  these  pictures?  From  the  blown  up  picture,  we  see 
that  the  data  appears  as  a  cloud.  The  error  blurs  the  structure.  There  is  a  more 
significant  problem  though.  We  can  choose  to  tighten  up  the  region  on  the  x-axis 
as  much  as  we  like,  but  we  have  no  control  on  the  y-axis,  the  standard  deviation  of 
the  error  controls  this  size.  Looking  at  too  small  a  region  on  the  x-axis  produces  the 
phenomenon  which  we  termed  the  “tall  skinny  box”  problem.  Note  the  x-scale  versus 
the  y-scale.  The  increment  on  the  y-axis  is  about  0.08  units,  while  the  increment  on 
the  x-axis  is  only  0.003  (differing  by  a  factor  of  approximately  30).  Note  that  both 
the  fits  are  not  very  good.  The  slope  of  the  line  is  off  (imagine  a  line  tangent  to 
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Figure  4.7:  Plot  demonstrating  that  the  linear  model  becomes  more  appropriate  as 
the  nearborhood  size  becomes  smaller,  t  >  0  implies  that  the  quadratic  map  must 
be  used,  t  <0  implies  that  the  linear  map  is  appropriate. 

the  attractor)  and  the  parabola  is  too  narrow.  In  even  more  extreme  cases  we  get  a 
computer  error  (much  akin  to  division  by  zero) ,  since  the  data  begins  to  line  up  along 
a  vertical  line. 

In  writing  the  code  for  this  problem  the  first  step  was  to  start  searching  for 
what  we  call  the  critical  value  of  k,  kcr-  We  defined  k^r  as  follows.  Say  we  were 
attempting  to  fit  a  line  to  the  k-near  neighbors.  Then  for  each  value  of  k  we  fit  the 
data  with  a  parabola,  y  =  bo +  biX  +  b2X^.  By  doing  so  we  assume  that  the  data  is 
coming  from  a  parabola,  y  =  /^o  +  We  then  use  the  t-test  to  tell  us 

whether  or  not  P2  =  0-  really  necessary  to  look  for  a  parabola  or  would  a 

line  fit  the  data  just  as  well?  We  call  kcr  the  value  of  k  such  that  for  k  =  kcr,  the 
t-test  says  use  the  line,  but  for  k  =  kcr  +  I,  the  t-test  says  use  the  parabola. 

A  problem  occurred  when  using  large  amounts  of  data.  Look  at  the  logistic 
map  with  N  =  1000,  and  let’s  suppose  that  we  want  to  fit  a  line  through  2:0  =  0.35. 
Consider  Figure  (4.7).  On  the  x-axis  is  k,  the  number  of  near  neighbors  to  xq.  On 
the  y-axis  is  t  =  -  tcomp  (see  Section  3.17).  If  t  >  0,  then  we  conclude  that  we 

must  use  a  parabolic  fit,  but  if  t  <  0  a  linear  fit  will  suffice. 

For  this  set  of  data,  the  way  in  which  we  defined  kcr  is  unambiguous.  Once 
the  graph  crosses  into  the  linear  region  it  stays  in  the  linear  region.  But  for  larger 
amounts  of  data,  kcr  may  be  ambiguously  defined  i.e.  there  is  not  a  single  crossing 
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Figure  4.8:  Plot  of  t  versus  k  for  very  long  Logistic  time  series.  The  attractor  is 
more  densely  filled  than  in  the  previous  figure. 


point,  but  the  graph  may  oscillate  between  quadi'atic  and  linear  regions.  Figure  (4.8) 
is  the  plot  of  t  versus  k,  but  this  time  with  N  =  100000.  Notice  that  by  the  time  we 
have  narrowed  our  region  down  to  the  1000  nearest  neighbors,  we  are  already  in  the 
linear  region.  But  around  k  =  800,  the  graph  reenters  the  quadratic  region.  It  jumps 
back  and  forth  between  the  two  regions  several  more  times.  Therefore,  for  us  to  speak 
of  the  value  of  kcr  is  a  mistake,  as  we  have  defined  it,  k^r  is  not  unique.  Think  of 
it  in  the  following  terms.  Say  we  have  a  region  containing  k^  near  neighbors.  The 
t-test  tells  us  that  we  have  data  which  appears  linear.  When  we  add  the  next  nearest 
neighbor  into  the  analysis,  the  model  needs  the  quadratic  term  (by  definition  of  k^r)- 
It  may  very  well  happen  that  the  next  nearest  neighbor  again  makes  the  data  appear 
linear,  and  will  reverse  the  decision  of  the  t-test  back  to  linear. 

The  two  most  logical  choices  we  have  for  kcr  is  the  largest  value  or  the  smallest 
value  of  k  with  the  defining  property  of  kcr-  Often  scientists  like  to  fit  data  in  as 
small  a  region  as  possible.  This  is  because  the  error  term  on  the  Taylor  series 
approximation  to  the  function  is  governed  by  the  size  of  the  interval.  If  this  were 
our  goal  we  would  use  the  smallest  value  of  kcr-  But  as  we  get  to  very  small  values 
of  k  we  have  already  seen  the  tall  skinny  box  problem  arise.  Also  the  fluctuations 
at  the  extremely  small  values  of  k  can  be  thought  of  as  noise,  in  which  case  it  would 
be  almost  meaningless  to  use  any  of  them.  However,  this  begs  another  question. 
What  is  to  say  that  the  initial  crossing  of  the  function  from  the  quadratic  region  to 
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Figure  4.9:  Histogram  of  kcr  for  the  Logistic  Map. 

the  linear  region  is  not  itself  noise?  It  certainly  varies  with  the  random  error,  but  is 
it  more  a  function  of  the  attractor  or  the  random  error.  Figure  (4.9)  addresses  this 
issue.  Since  kcr  is  distributed  about  some  expected  value,  we  conclude  that  kcr  is 
a  function  of  the  attractor  and  the  point  to  be  predicted,  Xq.  We  therefore  make  the 
following  definition. 

Definition:  Given  a  1-dimensional  embedding  of  a  chaotic  dynamical  system, 
and  an  Xp  from  which  to  predict  f{xp)  based  on  linear  regression  of  degree  m  of 
the  k  nearest  neighbors  to  Xp,  define  kcr  s.s  follows,  kcr  =  max{A;  :  =  0  but 

(^mk+i  0}  where  is  the  coefficient  on  x^  in  the  regressed  model  of  /  about  Xp 
using  k  near  neighbors  to  Xp. 

4.2.  BAND  SIZE  ON  RESPONSES 

Say  we  have  an  embedded  time  series  in  1-dimension  and  we  would  like  to  look 
at  interval  responses  to  f(xp).  Then,  for  any  degree  polynomial  model,  we  have 
determined  a  way  to  choose  the  number  of  near  neighbors  to  use  in  the  fit  of  /.  Is 
there  an  algorithmic  way  to  choose  the  degree  of  the  polynomial?  Let  us  again  look 
at  the  data  from  the  BZ  reaction.  Figure  (4.10)  shows  the  confidence  bands  across 
the  spectrum  of  x's  covered  by  the  embedded  data.  To  make  the  graph  a  grid  of 
XpS  was  selected.  For  each  Xp,  we  determined  kcr  for  a  linear  fit  (i.e.  the  largest 
nmnber  of  near  neighbors  in  which  the  quadratic  fit  degenerates  into  a  linear  fit.) 
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Figure  4.10:  Confidence  bands  from  fitting  local  linear  model  to  BZ  reaction  data. 


This  test  requires  an  a  level,  which  is  the  probability  of  making  a  Type-I  error.  We 
set  a  =  0.1  which  means  that  for  the  k^r  near  neighbors  to  each  Xp,  we  have  a  10% 
chance  of  erroneously  declaring  the  linear  fit  not  statistically  valid.  We  then  fit  the 
95%  confidence  bands  to  each  predicted  response. 

Figure  (4.11)  contains  a  graph  which  is  generated  in  entirely  the  same  way 
except  that  instead  of  using  a  linear  fit  at  each  Xp,  we  use  a  quadratic  fit.  Note 
that  for  most  values  of  Xp  the  bands  from  the  quadratic  fit  appear  tighter  than  the 
linear  fit,  though  there  are  some  places  where  the  linear  fit  appears  smaller.  If 
we  need  to  decide  between  either  the  linear  or  the  quadratic  in  a  global  sense  then 
one  might  use  the  area  between  the  bands  as  the  deciding  factor,  with  the  smaller 
area  representing  more  accurate  prediction  capability.  We  ran  a  rough  numerical 
integration  to  approximate  the  area  between  the  bands  and  found  that  the  area 
enclosed  by  the  linear  fit,  Ai  =  5150,  while  the  area  enclosed  by  the  quadratic  fit  was 
A2  =  4570.  If  we  had  to  settle  on  one  fit  we  would  choose  the  quadratic,  but  we  do 
not  have  to  choose  one.  The  predicted  responses  are  not  global  in  nature,  they  are 
local  to  each  Xp.  Therefore  if  we  are  interested  in  attaining  the  smallest  bands,  it 
might  be  useful  to  determine  what  degree  fit  to  use  based  on  the  size  of  the  confidence 
interval  it  produces.  To  produce  the  global  picture  this  means  that  at  each  Xp  we 
would  keep  the  bands  which  have  the  smallest  width.  Figure  (4.12)  demonstrates 
what  that  would  look  like. 

While  this  procedure  is  advantageous  from  the  point  of  view  of  producing 
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Figure  4.11:  Confidence  bands  from  fitting  local  quadratic  model  to  BZ  reaction  data. 


Figure  4.12:  Confidence  bands  formed  by  pointwise  choosing  the  smaller  of  the  linear 
or  quadratic  bands. 
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smaller  bands,  it  has  some  drawbacks.  The  higher  order  models  allow  us  to  use  a 
larger  region  of  data  since  these  models  can  account  for  curvature  in  the  data.  This 
increases  the  number  of  data  points  that  we  are  fitting  the  model  to  and  the  number 
of  parameters  that  we  need  to  fit.  Therefore,  the  algorithm  takes  up  much  more  time 
to  run  on  the  computer. 

Statisticians  are  often  skeptical  about  fitting  higher  order  polynomials  (above 
cubic)  to  data.  We  recommend  that  for  a  typical  data  set  the  model  be  restricted  to 
the  linear  case  unless  there  is  some  theoretical  consideration  which  suggests  that  the 
data  tends  to  fit  well  by  a  higher  degree  polynomial  fit. 

4.3.  BENCHMARKING  THE  1-D  PROCEDURE 

Now  we  would  like  to  establish  that  the  above  described  procedure  is  giving 
us  correct  interval  predictions.  We  will  look  at  several  different  data  sets  and  run 
slightly  different  tests  on  each. 

4.3.1.  LOGISTIC  DATA 

The  version  of  the  Logistic  Map  we  use  to  test  the  algorithm  is 
iCn+l  =  f{Xn)  +  En  =  3.85  *  Xn{l  -  Xn)  + 

We  generate  a  time  series  by  iterating  this  map  from  an  initial  condition  of  Xq  =  0.7, 
and  at  each  step  we  add  a  normally  distributed  error  with  parameters  fj,  —  0  and 
a  =  0.01  to  each  term.  We  embed  the  data  and  from  the  first  100  values  of  the 
time  series,  we  make  90%  confidence  bands.  This  is  illustrated  in  the  Figure  (4.13). 
In  order  to  test  the  accuracy  of  these  bands,  we  then  look  at  the  next  n  embedded 
points  and  determine  the  percentage  of  points  which  lie  inside  the  bands.  The 
values  for  select  values  of  n  are  given  in  the  chart  in  Table  (4.1). 


n 

100 

1000 

10000 

Pn 

84.0 

87.8 

87.6 

Table  4.1:  For  a  test  series  n  units  long,  is  the  percentage 
of  times  the  true  evolution  of  the  system  fell  within  the  90% 
confidence  bands  generated  from  100  points. 

Next  we  examine  what  happens  as  the  number  of  points  used  to  generate  the 
bands  is  increased.  We  generate  the  bands  shown  in  Figure  (4.14)  by  using  the  first 
1000  points  of  the  series.  When  we  test  the  accuracy  of  these  bands,  we  get  the 
results  shown  in  Table  (4.2). 


Figure  4.13:  The  90%  confidence  bands  generated  from  a  100  point  Logistic  Map  time 
series. 


Figure  4.14:  The  90%  confidence  bands  generated  from  a  1000  point  Logistic  Map 
time  series. 
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Figure  4.15:  The  90%  confidence  bands  generated  from  a  5000  point  Logistic  Map 
time  series. 


n 

100 

1000 

10000 

Pn 

92.0 

90.6 

90.2 

Table  4.2:  For  a  test  series  n  units  long,  is  the  percentage 
of  times  the  true  evolution  of  the  system  fell  within  the  90% 
confidence  bands  generated  from  1000  points. 


Finally  we  use  5000  points  to  generate  the  90%  confidence  bands  shown  in 
Figure  (4.15).  The  results  are  given  in  Table  (4.3). 


n 

100 

1000 

10000 

Pn 

88.0 

89.3 

90.2 

Table  4.3:  For  a  test  series  n  units  long,  is  the  percentage 
of  times  the  true  evolution  of  the  system  fell  within  the  90% 
confidence  bands  generated  from  5000  points. 

We  should  expect  the  accuracy  of  the  prediction  bands  to  get  better  as  the 
number  of  points  used  to  generate  them  is  increased.  The  data  supports  this  to  some 
extent,  though  the  bands  generated  from  1000  points  appear  to  do  as  good  a  job  as 
the  bands  generated  from  5000  points. 
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Figure  4.16:  Lorenz  z{t)  time  series. 


4.3.2.  LORENZ  INTENSITY  RETURN  DATA 

Consider  a  Lorenz  time  series  {z{t)  :  t  >  0}.  Define  tn  as  the  time  at  which 
the  nth  local  maximum  occurs.  Let  /„  =  z{tn).  There  is  a  function  /,  known  as  an 
intensity  return  map,  which  maps  each  In  to  f{In)  —  In+i-  To  see  this,  we  generate 
the  time  series  {z{t)  :  t  G  {0, ...,  100}}  shown  in  Figure  (4.16).  If  we  plot  each  local 
maximum  versus  its  predecessor,  we  get  the  cusp  in  Figure  (4.17). 

We  generate  the  time  series  by  numerically  integrating  the  Lorenz  equations 
(three  ODE’s)  using  the  fourth-order  Runge-Kutta  method  (Burden,  Faires,  and 
Reynolds,  1978,  p.  244).  The  cusp  in  Figure  (4.17)  is  generated  by  choosing  a 

small  time  step  At  =  In  order  to  simulate  noise,  we  increase  the  time  step  to 

=  ^,  and  get  a  much  less  accurate  approximation  of  the  time  series.  Therefore 
the  intensity  return  data  will  contain  more  error  as  shown  in  Figure  (4.18). 

Now,  we  generate  confidence  bands  using  the  first  100  points  from  the  data 
(Figure  (4.19)).  For  the  next  n  points  in  the  sequence  we  determine  what  percentage 
Pn  of  the  data  lies  inside  the  bands  and  display  this  in  Table  (4.4). 


n 

100 

1000 

10000 

20000 

30000 

Pn 

96.0 

96.6 

96.0 

96.0 

95.9 

Table  4.4:  For  a  t( 

of  times  the  true  evolution  of  the  system  fell  within  the  90% 
confidence  bands  generated  from  100  points. 
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Figure  4.19:  The  90%  confidence  bands  generated  from  the  first  100  points  of  the 
successive  maxima  map  from  the  2:  Lorenz  time  series. 


These  bands  appear  to  be  96%  confidence  bands,  instead  of  the  hoped  for  90%. 
Figure  (4.20)  shows  the  bands  made  with  1000  points  of  the  sequence. 


n 

100 

1000 

10000 

20000 

30000 

Pn 

91.0 

91.2 

91.5 

91.8 

91.6 

Table  4.5: 


For  a  tes' 


series  n  units  long,  is  the  percentage 


of  times  the  true  evolution  of  the  system  fell  within  the  90% 
confidence  bands  generated  from  1000  points. 


The  results  in  Table  (4.5)  show  that  the  bands  come  reasonably  close  to  capturing 
90%  of  the  data. 

The  statistics  involved  in  these  tests  assume  that  the  error  on  the  data  is 
from  a  gaussian  distribution.  Instead  of  artificially  adding  a  Gaussian  noise  term, 
this  method  of  simulating  noise  produced  an  unknown  error  distribution.  This  test 
helps  to  validate  the  use  of  this  method  to  sets  on  which  the  distribution  is  either  not 
known  or  possibly  non-gaussian. 


4.3.3.  LASER  DATA 

In  1991,  the  Santa  Fe  Institute  released  several  time  series  for  a  competition 
dealing  with  predicting  the  evolution  of  the  time  series.  One  of  these  data  sets 
is  a  time  series  of  intensity  from  an  NH3-FIR  laser  collected  by  U.  Hubner,  N.  B. 
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Figure  4.20:  The  90%  confidence  bands  generated  from  the  first  1000  points  of  the 
successive  maxima  map  from  the  z  Lorenz  time  series. 


Abraham,  C.  0.  Weiss  and  others  at  PTB  Braunschweig  in  Germany  (Weigend  and 
Gershenfeld,  1993).  Figure  (4.21)  shows  the  first  1000  points  of  the  series. 

Hiibner  et  al.  show  that  the  semiclassical  laser  equations  are  the  same  as 
the  Lorenz  equations  with  the  parameters  chosen  correctly.  This  inspired  us  to  look 
at  the  intensity  return  plot.  Figure  (4.21)  contains  only  the  first  1000  points  of 
the  series.  The  entire  series  has  approximately  10000  data  points.  In  this  series 
there  are  approximately  1300  local  maxima.  The  intensity  return  plot  and  the  90% 
confidence  bands  are  shown  in  Figure  (4.22). 

The  center  section  of  the  data  is  well  grouped  together  with  only  a  few  out¬ 
liers.  The  extremely  low  values,  as  well  as  the  extremely  high  values,  of  /„  map  to 
a  wide  range  of  /n+i’s-  The  size  of  the  confidence  bands  reflect  the  reliability  of  our 
predictions.  In  the  regions  where  our  data  is  dispersed,  the  confidence  bands  are 
.wide.  Where  the  data  is  tightly  grouped,  the  bands  are  narrow.  From  the  size  of 
the  bands,  we  can  separate  the  range  of  values  of  the  /„’s  into  an  interval  which  is 
“predictable”  and  two  intervals  which  are  “unpredictable.” 

While  our  point  prediction  method  does  yield  a  prediction,  our  interval  pre¬ 
dictions  tell  us  where  we  can  trust  the  predictions.  One  might  offer  the  objection 
that  it  is  possible  to  look  at  the  data,  see  where  it  is  “dispersed”,  and  then  decide 
what  predictions  are  trustworthy  and  which  ones  are  not.  Recall  that  one  of  the  mo¬ 
tivations  for  this  project  is  to  provide  an  objective  method  for  determining  confidence 
on  predictions.  What  the  predictions  will  be  used  for  will  determine,  how  accurate 
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Figure  4.21:  NH3-FIR  laser  data  as  time  series. 


Figure  4.22:  Intensity  return  plot  for  the  NH3-FIR  laser  data.  This  data  provides  a 
good  example  of  an  embedding  which  has  regions  of  both  high  and  low  predictability. 
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the  predictions  need  to  be.  Once  a  tolerance  is  decided  upon,  we  can  determine 
(with  this  method)  the  width  of  the  confidence  interval,  and  if  it  exceeds  the  level  of 
acceptability,  then  we  know  not  to  trust  the  prediction. 

Being  able  to  predict  some  of  the  time  is  better  than  never  being  able  to 
predict.  Knowing  when  we  can  make  a  'prediction  on  the  data  is  better  than  not  being 
able  to  tell  the  difference  between  predictable  and  unpredictable.  Consider  the  system 
which  represents  the  “be  all  and  end  all”  for  time  series  prediction;  the  stock  market. 
If  an  analyst  could  recognize  when  the  stock  market  enters  a  region  of  predictability, 
then  he  could  determine  when  to  invest  and  when  to  wait.  Even  though  he  would 
not  be  able  to  make  predictions  all  of  the  time,  he  would  be  able  to  wait  until  the 
data  entered  a  region  of  predictability. 

The  stock  market  is  cursed  (from  a  prediction  point  of  view)  with  a  dimen¬ 
sionality  problem.  Its  dimension  is  so  high  that  we  may  not  have  enough  recorded 
data  to  be  able  to  adequately  cover  the  attractor.  This  would  be  like  trying  to 
determine  the  shape  of  the  Lorenz  attractor  by  covering  it  with  a  handful  of  points. 
The  process  of  time  delay  embedding  looks  back  through  the  record  of  the  time  series 
to  find  the  sections  of  data  which  most  closely  resemble  the  data  near  the  predic¬ 
tion  point.  If  it  takes  a  large  number  of  delays  to  be  able  to  uniquely  represent  a 
point  in  the  phase  space,  then  a  very  large  time  series  is  necessary  to  fill  out  the 
high-dimensional  attractor.  This  drives  up  the  length  of  the  recorded  data  set.  It 
may  be  that  the  stock  market  has  not  been  around  long  enough  to  see  enough  of  its 
attractor. 

This  method  lays  the  ground  work  for  the  more  general  multi-dimensional 
case.  In  one  dimension,  it  is  possible  to  look  at  the  data  and  determine  to  what 
degree  the  data  is  dispersed.  In  the  multi-dimensional  case  it  is  very  difficult,  if  not 
impossible  to  visually  display  the  data  in  an  informative  way.  When  this  algorithm 
is  generalized  to  higher  dimensions,  it  will  provide  a  quantitative  measure  of  the 
reliability  of  a  prediction  even  when  we  cannot  display  the  data  in  2  or  3  dimensions. 

4.4.  TIME  DELAY  EMBEDDING  THE  LASER  DATA 

The  intensity  return  map  for  the  laser  data  does  not  appear  to  embed  nicely  in 
one  dimension.  The  analysis  of  confidence  bands  is  useful  for  demonstrating  that  the 
size  of  the  confidence  bands  in  regions  of  low  predictability  are  large.  We  will  now 
attempt  to  time-delay-embed  the  laser  data. 

Recall  that  the  time  delay  vector  has  the  form 


Y{t)  =  (^(x(t)),5(x(t  -  T)),p(x(t  -  2r)),  ...,g{yi{t  -  (M  -  1)t))), 
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Figure  4.23:  Example  of  a  periodic  signal  being  sampled  at  its  natural  frequency. 

where  x(t)  is  some  unknown  vector  valued  function  consisting  of  all  the  pertinent 
variables  which  drive  the  output  of  the  laser.  Recall  that  the  laser  data  consists  of 
data  observed  during  an  actual  experiment.  As  such,  our  time  series  is  a  discrete 
sequence,  which  we  will  denote  as  {x{t)}^i,  for  some  large  N.  Therefore  our  obser¬ 
vation  function  g  must  map  the  state  vector  to  our  data  points,  5'(x(t))  =  x{t).  We 
will  now  illustrate  some  possible  ways  to  choose  T  and  M. 

4.4.1.  CHOOSING  r 

Consider  the  function  of  a;  =  sin(2Tvt).  How  many  times  should  we  sample  this 
function  to  get  an  adequate  idea  of  what  the  signal  looks  like?  The  period  of  this 
function  is  T  =  1.  If  we  sample  at  this  period,  we  get  a  constant  sequence  and  have 
no  idea  what  is  the  actual  function  as  shown  by  the  graph  in  Figure  (4.23). 

A  rule  of  thumb  is  to  sample  at  T  =  |  (one  quarter  the  natural  period,  in 
the  general  case).  By  doing  this  we  assure  ourselves  of  getting  a  more  accurate 
representation  of  the  signal. 

The  peaks  from  the  laser  data  occur  at  fairly  uniform  intervals.  Contained  within 
the  9093  total  data  points  are  1266  local  maxima.  Therefore  we  approximate  the 
natural  period  to  be  ^  =  7.2  data  points  per  cycle.  We  then  choose  r  =  2.  Figure 
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Figure  4.24:  Each  star  represents  a  discrete  data  point  of  the  signal.  A  time  delay 
of  twice  this  interval  works  well  for  embedding  the  data. 

(4.24)  shows  the  samples  of  the  signal. 

4.4.2.  CHOOSING  M 

Once  we  have  chosen  t  we  must  choose  M  the  embedding  dimension.  To  decide 
the  embedding  dimension  we  use  the  technique  of  finding  false  near  neighbors.  Say 
we  have  an  embedding  of  dimension  d.  Say  that  we  have  two  points  a;  and  y  which 
are  far  apart,  i.e.  they  are  not  near  neighbors  in  the  d  dimensional  embedding.  If, 
when  we  project  x  and  y  into  dimension  d-1,  they  appear  to  be  near  neighbors,  we 
say  that  x  and  y  are  false  near  neighbors  in  the  dimension  d-1  embedding.  The 
presence  of  many  false  near  neighbors  indicates  that  the  data  should  be  embedded  in 
a  higher  dimension.  This  test  requires  an  arbitrary  measure  of  closeness,  e.  If  two 
points  are  closer  together  than  e  in  the  d  - 1  dimensional  embedding  but  farther  apart 
than  e  in  the  d  dimensional  embedding,  they  are  considered  to  be  false  near  neighbors. 
Figure  (4.25)  shows  the  graph  of  false  near  neighbors  shown  as  a  percentage  of  the 
number  of  pairs  of  points  for  several  dimensions  and  e's. 

The  embedding  dimension  appears  to  be  d  =  4  or  5,  since  in  this  embedding  the 
percentage  of  false  near  neighbors  has  dropped  to  almost  zero. 
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Figure  4.25:  Percentage  of  false  near  neighbors  of  embedded  laser  data  for  various 
dimensions  and  epsilons. 


4.5.  WHERE  TO  GO  NEXT 

The  next  step  is  to  look  at  predictions  on  the  multi- dimensional  case  (M  >  !)• 
Let  X{ti)  be  the  delay  vector  at  time  U  for  a  given  time  series  and  embedding, 

X{U)  =  {x{U),x{U  -  r),x{U  -  2t),  ...,x{U  -  (M  -  1)t)). 

We  will  look  for  the  evolution  rule,  G  :  G(X(ti))  =  X(ti+i).  Say  we  want  to 
approximate  G  at  some  particular  X(tn).  We  collect  the  A;  nearest  neighbors  to 
X(ti),  and  regress  a  linear  approximation  for  G.  Again  we  will  use  the  ^arg^t  k  suA 
that  the  sub-model  is  appropriate.  In  this  notation  G  is  a  function  from  ’ 

G  is  a  degenerate  function  though,  since  all  but  one  component  of  X{tn+i)  is  already 
determined  by  X(tn).  This  will  simplify  the  computations  and  we  only  need  to  fit  a 
function  G^  :  G {X(ti))  =  x(ti-^.l)  (Bollt,  1999). 


5.  CONCLUSION 

In  this  project  we  explored  the  procedure  known  as  time  delay  embedding  and  its 
use  in  making  predictions  on  time  series  generated  from  chaotic  systems.  Our  goal 
was  to  determine  a  way  to  make  a  statement  of  confidence  along  with  the  prediction. 
We  had  two  other  constraints.  We  wanted  to  do  this  in  an  algorithmic  way  and  we 
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wanted  the  procedure  to  be  a  general  one  which,  in  principle,  we  could  apply  to  any 
chaotic  time  series. 

The  method  of  prediction  we  used  required  us  to  embed  the  time  series  into 
delay  space.  For  our  purposes,  we  assumed  that  we  are  given  a  good  embedding. 
This  means  that  we  know  the  time  delay,  and  the  embedding  dimension  for  the  time 
series.  To  make  a  prediction  from  a  point,  first  we  found  a  set  of  nearest  neighbors 
to  the  point  in  delay  space.  We  then  regressed  a  model  through  these  neighbors  and 
used  the  model  to  determine  the  evolution  of  the  prediction  point. 

We  showed  that  if  we  ensured  that  our  model  adequately  fit  the  data,  then 
statistics  would  provide  an  accurate  method  of  determining  confidence  intervals.  Our 
question  then  became  how  to  best  choose  k.  The  simplest  model  to  fit  is  the  linear 
one.  A  local  linear  approximation  to  an  analytic  function  is  valid  only  over  a  small 
region.  Since  this  project  only  considered  finite  fixed  data  sets,  k  directly  determined 
the  size  of  the  region  the  model  is  fit  over.  If  k  was  chosen  too  large,  the  data  would 
significantly  curve  away  from  the  model.  This  seemed  to  indicate  that  a  small  k 
would  be  best.  We  did  not  want  k  to  be  too  small  though,  since  the  effects  of  noise 
are  more  pronoimced  in  a  small  data  set.  We  showed  that  a  way  to  balance  these  two 
considerations  was  to  fit  the  largest  k  possible  such  that  the  model  still  sufficiently 
fits  the  data. 

What  does  it  mean  for  a  model  to  adequately  fit  data?  One  meaning  is  that 
the  model  one  degree  higher  actually  degenerates  to  the  original  model.  In  the  case 
of  the  linear  model,  we  would  first  fit  the  quadratic  (full)  model,  then  examine  the 
quadratic  coefficients.  If  they  were  insignificant  then  the  model  degenerates  to  the 
linear  (sub)  model.  In  this  case  we  say  that  the  linear  model  adequately  fits  the 
data.  Statistics  provides  a  way  to  test  these  coefficients  for  significance.  Since  we 
then  had  a  model  which  adequately  fit  the  data,  the  confidence  intervals  of  prediction 
were  accurate. 

In  answering  our  original  question  concerning  confidence  intervals,  we  had 
to  answer  a  more  basic  question  concerning  prediction.  Both  of  these  questions 
were  answered  using  statistics  which  provides  the  algorithm  based  approach  we  were 
looking  for.  It  also  gives  the  benefit  of  treating  the  time  series  as  a  generic  data  set 
without  attempting  to  utilize  any  knowledge  of  the  system  whence  the  data  came. 
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